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Abstract 

I present the first algorithm for stochastic finite-armed bandits that simultaneously enjoys order- 
optimal problem-dependent regret and worst-case regret. Besides the theoretical results, the new algo¬ 
rithm is simple, efficient and empirically superb. The approach is based on UCB, but with a carefully 
chosen confidence parameter that optimally balances the risk of failing confidence intervals against the 
cost of excessive optimism. 
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1 Introduction 


Finite-armed bandits are the simplest and most well-studied reinforcement learning setting where an agent 
must carefully balance exploration and exploitation in order to act well. This topic has seen an explosion 
of research over the past half-century, perhaps starting with the work by Robbins [1952]. While early 
researchers focussed on asymptotic results [Lai and Robbins, 1985, and others] or the Bayesian setting 
[Bradtetal., 1956, Gittins, 1979], recently the focus has shifted towards optimising finite-time frequen- 
tist guarantees and empirical performance. Despite the growing body of research there are still fundamental 
open problems, one of which I now close. 

I study the simplest setting with K arms and a subgaussian noise model. In each time step t the learner 
chooses an action I t € {1,..., K } and receives a reward // /-, + rjt where ji, is the unknown expected reward 
of arm i and the noise term rjt is sampled from some 1-subgaussian distribution that may depend on I t . For 
notational convenience assume throughout that jii > Ji 2 > ■ ■ ■ > jik and define A,; = // [ — ji, to be the 
gap between the expected means of the ith arm and the optimal arm. 1 The pseudo-regret of a strategy 7r is 
the difference between the expected rewards that would be obtained by the omnipotent strategy that always 
chooses the best arm and the expected rewards obtained by 7r. 


ftU(n) = nji i — E 


E 

L <=l 


ml 


where n is the horizon, I t is the action chosen at time step t and the expectation is taken with respect to the 
actions of the algorithm and the random rewards. There are now a plethora of algorithms with strong regret 

’This assumes the existence of a unique optimal arm, which is for mathematical convenience only. All regret bounds will hold 
with natural obvious modifications if multiple optimal arms are present. 
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guarantees, the simplest of which is the Upper Confidence Bound (UCB) algorithm by Agrawal [1995], 
Katehakis and Robbins [1995] and Auer et al. [2002]. 2 It satisfies 


O) € 0 



^7 log(n) 



(1) 


This result is known to be asymptotically order-optimal within a class of reasonable algorithms [Lai and Robbins, 
1985]. But there are other measures of optimality. When one considers the worst-case regret, it can be shown 
that 


sup R™ h 


{n) € ft ^ y/ nK log n 


Quite recently it was shown by Audibert and Bubeck [2009] that a modified version of UCB named MOSS 
enjoys a worst-case regret of 


supi?“ oss 



which improves on UCB by a factor of order yj logn and matches up to constant factors the lower bound 
given by Auer et al. [1995]. Unfortunately MOSS is not without its limitations. Specifically, one can con¬ 
struct regimes where the problem-dependent regret of MOSS is much worse than UCB. The improved UCB 
algorithm by Auer and Ortner [2010] bridges most of the gap. It satisfies a problem dependent regret that 
looks similar to Eq. (1) and a worst-case regret of 

sup ^improved ucb^) £ q (^/ nK \ ogK ^ > 

which is better than UCB, but still suboptimal. Even worse, the algorithm is overly complicated and em¬ 
pirically hopeless. Thompson sampling, originally proposed by Thompson [1933], has gained enormous 
popularity due to its impressive empirical performance [Chapelle and Li, 2011] and recent theoretical guar¬ 
antees [Kaufmann et ah, 2012b, Korda et ah, 2013, Agrawal and Goyal, 2012a,b, and others]. Nevertheless, 
it is known that when a Gaussian prior is used, it also suffers an Cl(yJnK log K) regret in the worst-case 
[Agrawal and Goyal, 2012a], 

My contribution is a new algorithm called Optimally Confident UCB (OCUCB), as well as theoretical 
analysis showing that 

sup «“«*(») 6 o ('/Kii) «?•*(») log (A) ) H,~p min ji^j . 

The new algorithm is based on UCB, but uses a carefully chosen confidence parameter that correctly bal¬ 
ances the risk of failing confidence intervals against the cost of excessive optimism. In contrast, UCB is too 
conservative, while MOSS is sometimes not conservative enough. The theoretical results are supported by 
experiments showing that OCUCB typically outperforms existing approaches (Appendix H). Besides this I 
also present a kind of non-asymptotic problem dependent lower bound that almost matches the upper bound 
(Appendix F). 

2 Agrawal [1995] and Katehakis and Robbins [1995] both proved asymptotic results for algorithms based on upper confidence 
bounds, while Auer et al. [2002] focussed on finite-time bounds. 
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2 Notation, Algorithm and Theorems 


Let fii )S be the empirical estimate of the reward 
of arm i based on the first s samples from arm 
i and /l, (t) be the empirical estimate of the re¬ 
ward of arm i based on the samples observed 
until time step t (non-inclusive). Define T t (f) to 
be the number of times arm i has been chosen 
up to (not including) time step t. The algorithm 
accepts as parameters the number of arms, the 
horizon, and two tunable variables a > 2 and 
ip > 2. The function log + is defined by log + (.x) = max {1. log(.x)}. A table of notation is available in 
Appendix I. 


Input: K , n, a, ip 
Choose each arm once 
for t G K + 1,..., n do 


a (ipn 

Choose I t = arg max/)j(t) + J log I — 


end 


Algorithm 1: Optimally Confident UCB 


Theorem 1 . If IS.k < 1 and a > 2 and ip > 2, then there exists a constant C\ (a. ip) depending only on a 
and ip such that 


K 


R 


lOcucb 


w < J2 


i=2 


Ci(a,ip) 

A,; 


log+l i 



Theorem 2. If Ak < 1 and a > 2 and ip > 2, then there exists a constant Co(a, if) depending only on a 
and ip such that 

sup R°™ cb (n) < C 2 (a,ip)VnK . 
u 

I make no effort to reduce the constants appearing in the regret bounds and for this reason they arc left 
unspecified. Instead, I focus on maximising the range of the tunable parameters for which the algorithm 
is provably order-optimal, both asymptotically and in the worst-case. The functions C\ and C *2 have a 
complicated structure, but satisfy 


Vi C {1,2} lim Ci(a,ip) = oo and lim Ci(a, ip) = oo and lim Ci(a,ip) = oo . 

a—Too a\2 ip—too 

It is possible to improve the range of ip to ip > 1 rather than ip > 2, but this would complicate an al¬ 
ready complicated proof. The algorithm is very insensitive to ip and a = 3 led to consistently excellent 
performance. A preliminary sensitivity analysis may be found in Appendix H. Both theorems depend on 
the assumption that A k < 1- The assumption can be relaxed without modifying the algorithm, and with an 
additive penalty of A i) on the regret. This is due to the fact that any reasonable algorithm must 

choose each arm at least once. 

The main difficulty in proving Theorems 1 and 2 is that the exploration bonus is simultaneously quite 
small and negatively correlated with t, while for UCB it is positively correlated. A consequence is that the 
analysis must show that t does not get too large relative to T\ (t) since otherwise the exploration bonus for 
the optimal arm may become too small. 


2.1 The Near-Correctness of a Conjecture 

It was conjectured by Bubeck and Cesa-Bianchi [2012] that the optimal regret might be 

i =2 * 
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where H = A ~ 2 is a quantity that appears in the best-arm identification literature [Bubeck et ah, 

2009, Audibert and Bubeck, 2010, Jamieson et al., 2014]. Unfortunately this result is not attainable. As¬ 
sume a standard Gaussian noise model and let = 1/2 and /X 2 = 1/2 — 1 / A and /x, = 0 lor i > 2, which 
implies that H = 4 (K — 2) + K 2 > n = K 2 . Suppose n is some policy satisfying R^(n) € o(K log K), 
which must be true for any policy witnessing Eq. (2). Then 


minE [Tj(n + 1)] £ o(log K). 


Let i = argmin i>2 E \T t (n + 1)] and define //' to be equal to /x except for the xth coordinate, which has 
\j!- = 1. Let / = l{Tj(n + 1) > n/2} and let P and P' be measures on the space of outcomes induced by 
the interaction between vr and environments /x and /x/ respectively. Then for all e > 0, 


77 (fl) 'ri 

R*(n) + R*,(n)>-{¥{!=!} + ¥'{1 = 0}) > - exp (- KL(P, P')) 


(6) if' 2 

= — exp 


E [Tj{n + 1)] ^ 


€cc(A' 2 - £ ), 


where (a) follows from Lemma 2.6 by Tsybakov [2008] and (b) by computing the KL divergence between P 
and P', which follows along standard lines [Auer et al., 1995]. By the assumption on ///(n) and for suitably 
small e we have 


R*,(n)eu;(K 2 -*). 

But this cannot be true for any policy satisfying Eq. (2) or even Eq. (1). Therefore the conjecture is not true. 
Lor the example given, i?,/(n) € f l(K log K) is necessary for any policy with sub-linear regret in ///, which 
matches the regret given in Theorem 1 . 

More intuitively, if Eq. (2) were true, then the existence of a single barely suboptimal arm would signif¬ 
icantly improve the regret relative to a problem without such an arm, which does not seem very plausible. 
The bound of Theorem 1 , on the other hand, depends less heavily on the smallest gap and more on the num¬ 
ber of arms that are nearly optimal. There are situations where the conjecture does hold. Specifically, when 
Hi = H, which is often approximately true (eg., if all suboptimal arms have the same gap, but this is not 
the only case). I believe the bound given in Theorem 1 is essentially the right form of the regret. Matching 
lower bounds are given in specific cases in Appendix L along with a generally applicable lower bound that 
is fractionally suboptimal. 

3 Proof of Theorem 1 

The proof is separated into four components. Lirst I introduce some new notation and basic algebraic results 
that will hint towards the form of the regret. I then derive the required concentration results showing that 
the empirical estimates of the means lie sufficiently close to the true values. These are used to define a set 
of failure events that occur with low probability. Then the number of times a suboptimal arm is pulled is 
bounded under the assumption that a failure event does not occur. Linally all components are combined 
with a carefully chosen regret decomposition. Throughout the proof I introduce a number of non-negative 
constants denoted by 7 , c 7 , ci, C 2 ,..., c\\ that must satisfy certain constraints, which are listed and analysed 
in Appendix D. Readers wishing to start with a warm-up may enjoy reading Appendix G where I give a 
simple and practical algorithm with the same regret guarantees as improved UCB, but with an easy proof 
relying only on existing techniques and a well-chosen regret decomposition. 
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Part 0: Setup 

I start by defining some new quantities. 


5 t = min 


— V" rniri {u t , T) 

n t—* 


Ui = U Ai 


c 9 , 

“A = log 



(3) 


where c§, cq and cio are constants to be chosen subsequently (described in Appendix D). A convenient (and 
slightly abbusive) notation is 5 a = 5 Ua . It is easy to check that ua and 5a are monotone non-increasing. 
Note that these definitions are all dependent, so the quantities must be extracted by staring at the relations. 
We shall gain a better understanding of ua and 5a later when analysing the regret. For now it is best to 
think of a as a (1 — 5 a) -probability bound on the number of times a A-suboptimal arm will be pulled. The 
following inequalities follow from straightforward algebraic manipulation. 


K 


Lemma 3. ^ A< eg (1 + log (cio)) 


i =2 


K 1 

i=2 


log H 


n 

% 


Lemma 4 . 5j < 5t+ i and ifT < S, then 5s < 5t • S/T. 

Lemma 5. Let 7 € (1, a/ 2) and c 7 , C 5 be as given in Appendix D and define 5 a by 
( k *-1 

5A = Cry I 5a + 'y ] 5ryk +1 

V k =0 


k* = min <! k : y k+l > —^ log 

1 A z 5 a 


(4) 


Then 5 a < 


£5 

n 


^2 ua+ ^2 Ui log 



Part 1: Concentration 

Lemma 6 . Let X\, X 2 ,... be sampled id.d. from some 1 -subgaussian distribution and let fij = V) _ i X s /t 
be the empirical mean based on the first t samples. Suppose f5 > 1. Then for all A > 0, 

pjat : \p t \ > log jr + c 4 a| < 5 A 2~ P . 

The proof may be found in Appendix A and is based on a peeling argument combined with Doob’s max¬ 
imal inequality (e.g., as was used by Audibert and Bubeck [2009], Bubeck [2010] and elsewhere). Define 

Pi,A > 1 by 


Pi, A = min <J3 > 1 : (Vi) \p i>t - Pi\ < 

Note that for fixed A the random variables p t ,A with i E {1, 
more, if i is fixed, then pi, a is non-increasing as A increases. 


..., K} are (mutually) 


(5) 

independent. Further- 


Lemma 7. ^ {Pi,A > 1} < <5A anddE[Pi,A — 1] < 25a- 
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Lemma 8. For all A, A' > 0, P 


< 25 A - 


y Pi^UiAi > 2 

i:Ai>A' 


^ ^ UjAj 
i:Ai>A' 


Lemma 9. 



E 

V-Pi, A = 


1 ^ 1 

min {ui,T} < p^minj^.T} > 

1 i= 1 J 


< 245a- 


Lemma 10. 


K 


3T : y; ft -a min {i 




i=i 


* 1 

T} > 67 y min {rij, T} > 

i=i J 


< 135a- 


Lemma 7 follows from Lemma 6. Lemma 8 follows from Lemma 7 via Markov’s inequality. The proofs 
of Lemmas 9 and 10 are given in Appendix B, with the only difficulty being the uniformity over T and 
because a naive application of the union bound would lead to an unpleasant dependence on K or n. Both 
results would follow trivially from Markov’s inequality for fixed T. 


Part 2: Failure Events 

For each A > 0, define Fa € {0,1} to be the event that one of the following does not hold: (6) 


(Cl) : 0i >A = 1 

(C2) : y A ; A' u *A,; <2 y iqAj 

i:Ai>c$A i:Ai>cgA 

1 K 

(C3) : VT : y min{uj, T} > - y min {ui, T} 

i-Pi, A =1 *=i 

K K 

(C4) : VT : /3j t a min , T} < 67 min (vij, T } . 

j =l i=i 

By Lemmas 7, 8 , 9 and 10 we have P{Fa} < (2 + 2 + 24 + 13)5a = 415a- Define 

A = sup {A : F a = 1} - (7) 

From the definition of /3j A we have that Fa = 1 for all A < A and Fa = 0 for all A > A. We will shortly 
see that the algorithm will quickly eliminate arms with gaps larger than A, while arms with gaps smaller 
than A may be chosen linearly often. 

Part 3: Bounding the Pull Counts 

This section contains the most important component of the proof, which is bounding Tj{n + 1) for arms 
j with A j larger than a constant factor times A. I abbreviate A = A a l° r this part. The proof is rather 
involved, so I try to give some intuition. We need to show that if T y it) = \/3jUj~\, then the error of the 
empirical estimate of the return of arm j and arm 1 are both around A j and that the bonus for arm j is also 
not significant. To do this we will show that the pull-counts of near-optimal arms are at least a constant 
proportion of Tj (t) and it is this that presents the most difficulty. 
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Lemma 11. Let t be some time step and i,j be arms such that: 

1- Pi = 1 2. c 2 /3jTi(t) < Tj(t ) 3. fn/t > 1 /5 Ti (t) ciTj(t) < ruin {it*, n^}. 77ien I t / j. 

Proof. Arm j is not played if arm i has a larger index. 


A i(t) 


(b) 

> Pj + 


a 


Ti(t) 


log 


ibn\ (“) 


a 


m 


log 


/ ipn\ 

{—) 


a 


m ) 


log(^)- 


/ 2 7 

ri(t) 


log 




/ 2 7 

V 

— A* — C4A 


log 




— C 4 A 


( c ) „ . , 

> pj (t) + 


(d) 

> Aj(i) + 


( e ) „ , , 

> pj(t) + 


a 


m ) 


log 


i/m\ 

~T J “ 


I 2 7 

^<(t) 


log 


^(t) 


^ 2 K 0 


log 


1 


-A j - 2 c 4 A 


a 


r<(‘) 


log 


t 


[2^ 


C 2 


2 i(t) 


log 


%(t) 


-A j - 2 c 4 A 


a 




log 


fm 

t 


+ 1 - 


a 


Ti(«) 


logl^ 


-I^+M 


log 


C2 J V Ti(t) \STi(t) 


- Aj - 2 c 4 A 


> £tj(i) + 


+ 


(s) , . , 

> + 


(/i) 

> Aj(0 + 


a 


2 }(t) 


log 


?/>n 


a — 


-y/27- 



^(t) 


log 


<%(t) 


a 


Tj (t) 


log 


rfm 

t 


+ max 


1 ci 




log 


1 


' — log 

U A 


- At - 2c 4 A 

- A* - 2c 4 A 


1 


a 


r,(t) 


log 


ifm 

t 


where (a) follows since /?* = 1 and = 0, (b) since pi = pi — A* > pj — (c) since = 0, (d) 

and (e) since c 2 f3jTi(t) < Tj(t ) and because 5t is monotone non-decreasing, (f) since fm/t > l/dj^.u) 
is assumed, (g) from the constraint on C 2 (Const9) and because Tft) < min {u,/ci. u^/c.\ }, (h) from the 
constraint on ci (ConstlO) and from max {x, y} > x/2 + y/2. □ 


Lemma 12. Let t be some time step and j be an arm such that: 

1. Aj > c 8 A 2. Tj(t) = \(3jUj-\ 3. c 2 /3jTi(t) > Tj(t) orTi(t) > u^/c\ 4. fn/t < C7(3j/5 Uj . 

Then I t f j. 
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Proof. Using a similar argument as in the proof of the previous lemma. 


At (t) + 


($) 

~ Pj + 


a 


W) 


. , ibn\ ( a ) 

log > pi + 


1 a f ipn\ 


I 2 7 

W) 


log 


’h(t) 


— C 4 A 


a 


^i(t) 


logl^l- 


I 2 7 

^i(t) 


log 


1 




+ A j - c 4 A 


(c) ^ 

> + 


> pj(t ) + 


— max 


(e) W X 

> Pj(t ) + 


(/) „ . . 

> + 

(s) „ . , 

> /Tj(i) + 


> pj(t ) + 


(i) 

> pj (t) + 


/ W) % 1 

{—)' 

/r l(t ) lu6 ' 

( fn\ 

\TJ 



( ijm\ 

V t ) 

1 / m ^ 

f ifri \ 

/ r jW 1 

( f>n\ 

uJ 

i m lus 

/ fn 7 

w 

/ *>«) lufe l 

f i/m\ 

K~)' 


1 27 


log 


5 A(t) 


2 7/?j 

^ T;(f) 


log 




+ A,- - 2c 4 A 


\ 


27C1 


Cl 


log 1— 


- t / — lo § ( ) + A i - 2c 4 A 


+ Aj/2 — (2 + c 3 )c 4 A 


1 m lufe 

+ Aj/2 - (2 + c 3 )c 4 A 

/ t jW 1 

^ ^ ^ + Aj/2 (2 + c 3 )c 4 A 

/ lub l 

” ^ + Aj/2 (2 + c 3 )c 4 Aj/c 8 


where (a) follows since /3 4 = 1 and because I f = 0, (b) since p\ = pj + Aj, (c) by the definition of (3j 
and because = 0 does not hold, (d) by the assumption that cifjT\(t) > Tj(t ) or ciTi(t) > and 
because Tj(t) = \PjUj] and Lemma 4, (e) by the constraints on Uj (Const 12,Const 13) and on c\ (Constl 1). 
(f) is trivial, (g) since we assumed ipn/t < cq j S u ., (h) since Aj > eg A, (i) from constraints (Constl4) and 
(Const7). □ 


Lemma 13. If Aj > c$ A, then Tj(n + 1) < [ Pjuf. 


Proof We need two results for all t: 

L Pi = 1 =4> t < 1 pnS Ti ( t ) or ciTft) > ruin [u z , u^}. 

2. cqPjt > ipnS Tj (t) 
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Both are trivial for t = K + 1. Assume (a) and (b) hold for all s < t. Then 


(a) 


K 


( 6 ) 


(c) 


c 7 /3jt > c 7 Pj^Ti{t) > c 7 f5j ^2 1 H*) > c 7/% X] 


min 


rq Tj (t ) 1 

Cl' Cl ’ c 2 /3j 


»=1 i'-Pi =1 v 3 

w c 7 /3, ^ . f (e) c 7 /3j ^ . f 7)(f)-l 

> —— z. mm i Ui > u a’ —r - tttz z. mm i^’ 


Ci +C2 ‘ 

i'-Pi=l 


Cl +C2 ' 

*:p; = l 


Pi 


(/) C 7 V-^ • f rp (3) C 7 


ir 


> 


Cl + c 2 


^ min {u,i,Tj(t) — 1} > 


i:/3i=l 


5(ci + C 2 ) 


y^ min {rq, ?)(()} - K 


\ 2—1 


> 


c 7 n 


5c 6 (ci+c 2 ) V n 


’Tj® 


c e K\ W c 7 n 


> 


O') 


lOcefc+cZ-M - ^<‘> ’ 


where (a) and (b) are trivial, (c) follows from Lemma 11 and the assumption that 1. and 2. hold for all 
s < t. (d) follows since c\. c 2 > 1. (e) since Tj(t)//3j < Uj < by Lemma 12 and the assumption that 
A y > c 8 A. (f) is trivial, (g) follows from condition (C3). (h) follows from the definition of S Tj (t) • (i) by 
naively bounding ^(t) an d (j) by the definition of c 7 (Const2). Therefore 2. holds also for t. Now suppose 
j3i = 1 and C] T, (f) < min u^}. Then by Lemma 11 we have for any k that T k (t) < cptikTjit) + 1. 
If Afc > c 8 A, then T k (t) < /3kU k + 1- On the other hand, if A*. < c 8 A. Then Tk{t) < c 2 f3kTi(t) + 1 < 
C2/3fc«A + 1 < c 2 cl/3kUk + 1. Therefore 

K K 


t - 1 + r fc(*) < -^ + 1 + c 2 cf /3 fc min {Tj(f), rt fc } 


(b) 


fc=l 


fc=l 


(c) 


if 


< K + 1 + 67c 2 c| ^ min {Tj(t), u k } = K + 1 + 


67c 2 Cg 


n<5- 


fc=i 


C6 


m) 


(0 134c2c| (/) 

< — nd Ti{t) = ipnd T . {t) , 

where (a) is trivial, (b) follows from the reasoning above the display and naively choosing largest possible 
constant, (c) follows from condition (C4) in the definition of the failure event, (d) by substituting the 
definition of <%(*)• ( e ) by naively bounding drpL) and noting that T,(t) > 1. (f) is the definition of c% 
(Constl). Therefore 1. and 2. hold for all t and so by Lemmas 11 and 12 we have Tj(t) < \/3jUj~\ as 
required. □ 


Part 4: Regret Decomposition 

Let R = npi — Ylt= 1 ML be the pseudo-regret (this is a random variable because there is no expectation on 
It). From the previous section, if A,; > c 8 A, then Tj(n + 1) < /3- ^U{ 


. Therefore 


K 


R < c 8 nA • l|c s A > A 2 | + ^ Pi,A Ui < c 8 nA • l|c 8 A > A 2 | + 3^ AjUj , 

i:Ai>csA 

which follows from the definition of the failure event (C2) and naive simplification. Therefore 

K 

fi° cucb (n) = ER < 3 ^ A iUi + c 8 nE A • l{c 8 A > A 2 } 

i =2 


( 8 ) 
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All that remains is to bound the expectation. Starting with an easy lemma. 

Lemma 14. J u/\dA < ^2 + log an ^ J Ui ^°S ( y ~^j ^ A — 

Proof By straight-forward calculus and Lemma 17 in Appendix C. 

£ UAdA =£ $ log (a) dA =£ ^ log (£' t:) dA 

£> g (££KfK 2+log (£ 

( — ^ dA< f m log ( dA = 2AiUi. 


< 


For the second part Lemma 18 gives / Ui log 


\ u i J 


A 2 


Lemma 15. nE 


Ai{a > A 2 /c 8 } 


K 


< Cll A i u i■ 

1=2 


Proof. Preparing to use the previous lemma. 

A 2 


E 


AIK A > 


C8 


<—f{a> — i + 

^8 l C 8 J J A 2 /cs 
< A £a 2 /c a | ' 0 ° 


[ p{a>a) 

J A2/cfi ^ ' 


dA 


f 

J A 


5aAA 


c 8 J A 2 /c 8 

Bounding each term separately. First, by Lemmas 5 and 18 we have 

^2<5a 2 /c g 1 C5A 2 ( ST-' , f u A 2 /c s 

- <-I UA 2 /c 8 + 2 ^ Ui l0g 


C8 


n c 8 

1 


i=2 

K 


i=2 


Ui 

c|_A? 

A 


< - • c 5 c 8 A 2 ( ua 2 + 2 log ( At ) - n' 2(1+log ^ c8 )) c5C8 A ^ 


ir 


i=2 


For the second term, using Lemma 5 again, as well as Lemma 14 


□ 


f §AdA < — [ I V' «&+ V] Uilog ( — ) \ dA 

■Ws n J A 2 /C8 ^ 


C5 

n 


i:ui<fu/± 

K „a, 


poo POO /• 

/ UA<iA + ^2 UAdA + y^ / 

J A2/C8 »_o */ Ai „■ o J 0 


i=2 J Ai 


i=2 J0 


; l°g 


< £5 / C 9 


n l A 2 /c 8 


2 + log 


cio 




Ui 


i=2 


The result follows by choosing cn = 2csc 8 (l + log(c 8 )) + C5(3c 8 log(c 8 ) + 5). 


□ 
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And with this we have the final piece of the puzzle. Substituting Lemma 15 into Eq. (8): 


K 


i?;r b (n)gOl^A ; 


Ui 


,i=2 


Collecting the constants and applying Lemma 3 leads to 

K K , 

R°™ cb {n) < (3 + cu) Y Am < c 9 (3 + cn) (1 + log (ci 0 )) Y lo § ( Jf. 

i=2 i=2 1 ' 1 

The result is completed by choosing C\{a, ip) = C9 (3 + cn) (1 + log (cio)). 


4 Proof of Theorem 2 


The proof follows exactly as the proof of Theorem 1 , but bounding the regret due to arms with A* <y[K] n 
by y/Kn. Then 


E R°™\n) Y 

Kjn 


log+ 



< V Kn + Ci(a, ip)VKn, 


where the last line follows by substituting the definition of Hi and solving the optimisation problem. Finally 

set C 2 (a,ip) = 1 + Ci(a,ip). 


5 Brief Experiments 

The graph on the right teases the worst-case performance 
of OCUCB relative to UCB and Thompson Sampling when 
n = 10 4 , K = 2 and where A2 is varied. Precise de¬ 
tails are given in Appendix H where OCUCB is compre¬ 
hensively evaluated in a variety of regimes and compared to 
many strategies including MOSS, AOCUCB and the finite- 
horizon Gittins index strategy. 

6 Conclusions 



The Optimally Confident UCB algorithm is the first algorithm that simultaneously enjoys order-optimal 
problem-dependent and worst-case regret guarantees. The algorithm is simple, extremely efficient (see 
Appendix E) and empirically superb (see Appendix H). The main conceptual contribution is a greater un¬ 
derstanding of how to optimally select the confidence level when designing optimistic algorithms for solving 
the exploration/exploitation trade-off. There are some open problems. 

Improving Analysis and Constants. Much effort has been made to maximise the region of the parameters 
a and ip for which order-optimal regret is guaranteed. Unfortunately the empirical choices are not supported 
by minimising the regret bound with respect to a and ip (which in any case would be herculean task). The 
open problem is to derive a simple proof of the main theorems for which the theoretically optimal a and ip 
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are also practical. Along the way it should be possible to modify the index to show exact asymptotically 
optimality. My presumption is that this can be done by setting a = 2 and adding an additional O (log log t) 
bonus as in KL-UCB. 

Anytime Algorithms. The new algorithm is not anytime because it requires knowledge of the horizon in 
advance (MOSS is also not anytime, but Thompson sampling is). It should be possible to apply the same 
repeated restarting idea as was used by Auer and Ortner [2010], but this is seldom practical. Instead it would 
be better to modify the algorithm to smoothly adapt to an increasing horizon. As an aside, an algorithm is not 
necessarily worse because it needs to know the horizon in advance. An occasionally reasonable alternative 
view is that such algorithms have an advantage because they can exploit available information. There may 
be cause to modify Thompson sampling (or other algorithms) so that they can also exploit a known horizon. 

Exploiting Low Variance. There is also the question of exploiting low variance when the rewards are not 
Gaussian. Much work has been done in this setting, especially when the rewards arc bounded (Eg., the KL- 
UCB algorithm by Garivier [201 1], Maillard et al. [201 1], Cappe et al. [2013] or UCB-V by Audibert et al. 
[2007]), but also more generally [Bubeck et al., 2013]. It is not hard to believe that some of the ideas used 
in this paper extend to those settings (or vice versa). Related is the question of how to trade robustness and 
expected regret. Merely increasing i[> (or a) will decrease the variance of OCUCB, while perhaps retaining 
many of the positive qualities of the choice of confidence interval. A theoretical and empirical investigation 
would be interesting. 

Optimal Lower Bounds. The lower bound given in Appendix F is very slightly suboptimal and can likely 
be strengthened by removing the log log K term. Likely the form of the statement can also be altered to 
emphasise the OCUCB really is making a well-justified trade-off. 

Extensions. Besides the improvement for finite-armed bandits, I am hopeful that some of the techniques 
may also be generalisable to the stochastic linear (or contextual) bandit settings for which we do not yet 
have worst-case optimal algorithms (see, for example, the work by Dani et al. [2008], Abbasi-Yadkori et al. 
[2011], Carpentier and Munos [2012], Rusmevichientong and Tsitsiklis [2010]). 
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A Proof of Lemma 6 

I briefly prove a maximal version of the standard concentration inequalities for i.i.d subgaussian random 
variables. The proof is totally standard and presumably has been written elsewhere, but a reference proved 
elusive. Since X* is 1-subgaussian, by definition it satisfies 

(VA el) E [exp (AX*)] < exp (A 2 /2) . 
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Now X\, X‘ 2 ,... are i.i.d. and zero mean, so by convexity of the exponential function exp(A X s ) is a 
sub-martingale. Therefore if e > 0, then by Doob’s maximal inequality 


3t < n : ^2 X s > £ I" = inf P 3t < n : exp | A X s j > exp (Ae) 


S= 1 


S= 1 
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< mt exp —-Ae 
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= exp| -i) ' 


Now we use the peeling argument. 
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where (a) follows from the union bound, (b) follows from the definition of ft, (c) follows since 5t is non¬ 
decreasing, (d) since A > 0 and t > (e) from the maximal inequality Eq. (9) and (f) since c)/ < 1/2 for 

all t. Let 


k* = min | k : 7 fc+1 > 1/A 2 log 1 /^a| • 
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Then 7 fc+1 < ua for all k < k* and so 

OO 
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B Proof of Regularity Lemmas 


I make use of the following version of Chernoff’s bound. 

Lemma 16 (Chernoff Bound). Let X \...., X n be independent Bernoulli random variables with EX* < //. 
Then 

_ 2 ' 


>^ + £ | < eX P 


ne 

3/r 


Lemma 9. For fc € {0,1,...} let Sk = {2 k , ..., min { K , 2 fc+1 — l}}. Define /c max = min {k : J\ £ S^}, 
which means for k < k mSiX we have \Sk\ = 2 fc and Ufc=cT Sfc = {1, - - -, IF}. Define S}. )( g =1 = {i G Sk ■ Pi ,a = 1}. 
Then by Chemoff’s bound (Lemma 16) and the union bound 
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where (a) is trivial, (b) since m is non-increasing, (c) since |S*| = 2 k and S k ,p= \1 > 2 k 1 , (d) since 
|5 fc+ i| < 2 k+1 , (e) since m is non-increasing, (f) since U*<fc max &k +1 = { 2 ,... ,K}. Finally note that 
Pi, A = 1, since |*S'o,/3>i| < 1 / 2 . Therefore 5 J2i-.Pi A =i min { U u P J2k=l min { u ii T }• □ 

Lemma 10. We make a similar argument as above. Let S k and k m;ix be as in the proof of Lemma 9 and 
Sk,/3 = {i £ Sf. : > P}. Then by Lemma 6 and Chernoff’s bound we have 

P { 3k € {0,1,..., k max } and 0 € {2,3,...} : \S k ,p\ > \S k \2~^ A } 

oo oo oo 

<£ 3S A 2-W-'+ •£■£ 35 A 2- k 2~P / 2 < 135 a • 

/3=2 k=0P=2 


Now assume that 15*^1 < 15*12 for all k G {0,1,..., fc max } and /3 € {2, 3,...}. Then 
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67 ^ min {u t , T} 
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where (a) is trivial, (b) from the definition of S k ,p, (c) since we have assumed that \S k ,p\ < |5*|2 _/3 / 4 and 
by the monotonicity of m, (d) by evaluating the (almost) geometric series, (e) since 5*j < for all 

k > 1 , (f) since m is non-increasing and (g) is trivial. □ 


C Technical Lemmas 


5 /\2 

Lemma 17. If A > A*, then 

5a A j 
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Proof. The result follows from the definition of <5 a, and the fact that 


K K 

min {ui, Uj} = min 
3 =i i=t 
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log 
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where the last inequality follows since Ui > u a and so <5 ni > <i MA . 

Lemma 18. If A < A,;, then < —f. 

m A 2 

Proof Since A < A t we have « a > Ui and so <5 a > <5 Ui . Therefore 

'«A _ a* log (si) < A? 
A 2 log “ A 2 


as required. 


log 
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D Constants and Constraints 


Here we analyse the various constants and corresponding constraints used in the proof of Theorem 1 . We 
have the following constraints. 


c 6 = 134c 2 c|/V’ 
c 7 = lO^cg (ci + c 2 ) 

oo 

% = E 23 ~ 7 ‘ 

k =0 
c 4 = v /27 
_ 247c 6 c 7 
° 5 n (7 - 1 ) 
il> > 2 

(2 + c 3 )c 4 /c 8 < 1/4 
7 € ( 1 , a/ 2 ). 
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N log 1 

UA 

/c 2 \ 

Ua ) 

\/S log| 
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(Const9) 
(Const 10) 
(Const 11) 
(Constl2) 

(Constl3) 

(Const 14) 


cn = 2 c 5 c 8 (1 + log(c 8 )) + c 5 (3c 8 log(c 8 ) + 5). (Constl5) 


Satisfying the Constraints 

First we satisfy (Const9) by choosing 


c 2 


( y/a + ^2g \ 2 


which by the assumption that 7 < a/2 is finite. We observe that Eq. (Const 11-Const 14) are satisfied by 
choosing 


ua > max 


16 • 27c 2 
A 2 


log 



16a 

"a 2 " 


log 



2 7 ci 

c|c|A 2 bg 



For the sake of simplicity we will be conservative by choosing 


eg , 

“A = ^7 lo § 



C 9 


max 


| 327 c 2 , 16a, 


27 C 1 ) 

clcii 


C10 = max{c 2 , c 7 ip, ci'ip} . 


Then (Const 10) can be satisfied by choosing c\ and c 3 sufficiently large. So increasing c 4 increases ua, but 
the latter dependence is logarithmic, which means that for sufficiently large c 4 the relation will be satisfied. 
Now (Const7) can be satisfied by choosing c 8 sufficiently large. Now (Const 8 ) can be satisfied provided 
a > 2, which we assumed in both Theorem 1 and Theorem 2. 
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E Computation Time 


A naive implementation of the Optimally Confident UCB algorithm requires O(K) computation per time 
step. For large K it is possible to obtain a significant performance gain by noting that the index of unplayed 
arms is strictly decreasing, which means the algorithm only needs to re-sort the arms for which the index at 
the time of last play exceeds the index of the previously played arm. The running time of this algorithm over 
n time steps is 0(n) in expectation (asymptotically). This observation also applies to MOSS, for which the 
index of unplayed arms does not change with time at all. 3 In contrast, for Thompson sampling it seems that 
sampling all arms at every time step is essentially unavoidable without significantly changing the algorithm. 


F Lower Bounds 


Throughout this section I consider a single fixed and arbitrary policy ir. Starting with a simple case, let 
A > 0 and define pi € M A for i € {1 ,... , K } by 



A 

2A 

0 


if k = 1 
if k = i 
otherwise. 


Let E, denote the expectation with respect to the measure on outcomes induced by the combination of the 
fixed strategy with environment //.' and standard Gaussian noise. Let IP, be the corresponding measure. 

Theorem 19. Assume H = (K — 1)/A 2 < n/e. Then there exists an i such that 


_ 1 K — 1 / n 

R A n ^r~ lo fin 


Remark 20. Up to constant factors H coincides with Hj for all j and all reward vectors //' so the theorem 
implies the upper bound in Theorem 1 is tight for at least one of the reward vectors //. 

Theorem 19. Define A,; = li{7) (n + 1) > n/2} be the event that the /th arm is chosen at least n/2 times. 
Suppose that 


(3* > 1) 






Then an application of Lemma 2.6 by Tsybakov [2008] leads to 


Pi {Ai} + Pj Ai} > exp (-KL(Pi,Pi)) = exp (-2A 2 EiTj(n + 1)) 



( 10 ) 


Therefore 


Rpi ( n ) + ( n ) ^ 


nA H 


n 


log 


K — 1 


A 


log 


3 This property makes it trivial to implement MOSS in 0(log K) per time step using a priority queue, but for long horizons one 
should expect even better performance. 
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If Eq. (10) does not hold, then 


^i(n) > 

where the last inequality follows 
there exists an i such that 


(^- 1 ) 

2A 


log 


n 


n / — 


1 K - 1 

> - 

4 A 


H\og% 

since log (x/ log(x)) > log(x)/2 for x 



> e. Therefore we conclude that 


RIM) > 


1 

4 ' 


K - 1 
A 



as required. 


□ 


The lower bound matches the upper bound for this problem given in Theorem 1. It should be emphasised 
that if Eq. (10) does not hold by a largish margin, then the penalty in environment i is enormous relative to 
the logarithmic penalty of exploring arm i, which means that in some sense it is optimal to explore arm i 
such that 

Ei[Ti(n + 1)] G H ^^2 log (|f)) • 


Unbalanced reward vector. For the case that p is arbitrary and //'■ = fi :j + 2A y l{i = j} it is possible to 
show that there exists an i such that 

( \\ 

1 n 

v l0gA E M * mhl {(AlTA-) 2 ’ 

This matches the upper bound given in Theorem 1 except for the extraneous log K in the denominator of the 
logarithm. I believe the upper bound is tight, which is corroborated in certain cases including the uniform 
case explored in Theorem 19 and the highly non-uniform case discussed in Section 2.1. The omitted proof 
of the above result is an algebraic mess, but follows along the same lines as Theorem 19. 


R^n) € n 


£ 

3^ 


1 


A,- + A s 


log 


G Almost Optimally Confident UCB 

Here I present a practical and less aggressive version of Algorithm 1 that manages the same regret as im¬ 
proved UCB by Auer and Ortner [2010]. While the regret guarantees are not quite optimal, the proof is so 
straightforward it would be remiss not to include it. 


Input: K, n 

Choose each arm once 
for t € K + 1,..., n do 

Choose It = arg max /q (f) + 

l 

end 



Algorithm 2: Almost Optimally Confident UCB 
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Theorem 21. There exist universal constants C 3 and C 4 such that for all 5 > 0 , 

R a ° cucb (n)<n8+ ^ ^log + (nA 2 ) and R a ° cucb (n) < C 4 sJnK log K. 
i:A;><5 * 

Some remarks before the proof: 

• The constant appearing inside the square root is the smallest known for a UCB-style algorithm with 
finite-time guarantees. Other algorithms require at least 2 + e with arbitrary e > 0, but with a bound 
that tends to infinity as e becomes small. There are asymptotic results when the constant is 2 by 
Katehakis and Robbins [1995], which leaves open the possibility for improved analysis. 

• The algorithm is strictly less aggressive than both MOSS and OCUCB, which eases the analysis and 
saves it from the poor problem dependent regret of MOSS. 

X 

Theorem 21. I write /(•) < g(-) if there is a universal constant c such that /(•) < c • <]{■). First I note that 
for any A > 0 


P < 3t < n : + 


■log 


Ti (t) 


n 


< /rr - A l < -^ iog (nA 2 ) 


The proof of this claim follows from a peeling device on a geometric grid with parameter 7 that must then 
be optimised. Let 


A = min — (i\ (t) — ^ 
For each sub-optimal arm i define a stopping time 

f I ‘ 

n = min it : p,i(t) + 


■log 


Ti(t) 


n 


t < n 


■log 


n 


< //i + Aj/2 


, Ti(t) 

This is essentially identical to that used by Audibert and Bubeck [2009] where it is shown that 

e PKa)] < ^ log + (nA 2 ). 


Now if A < Aj/2, then Tj(n + 1) < Tj(rj). Therefore 

A, 


E R (n + 1) < E 


r*l< A < — > + nl< A > — 


A,. 


< ^ log + (nA 2 ) 


The result follows by bounding X^ A;<<5 ^*( n + 1) < n. 
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H Experiments 


Before the experiments some book-keeping. All code will be made available in any final version. Error 
bars depict two standard errors and are omitted when they are too small to see. Each data-point is an i.i.d. 
estimate based on N samples, which is given in the title of all plots. The noise model is a standard Gaussian 
in all experiments (rj t ~ Af(0, 1)). I compare the new algorithm with a variety of algorithms in different 
regimes. First though I evaluate the sensitivity of OCUCB to the main parameter a in two key regimes. The 
first is when the horizon is fixed to n = 10 4 and there is a single optimal arm and A, = A for all suboptimal 
arms. The second regime is like the first, but A = 3/10 is fixed and n is varied. The results (see Fig. 1) 
unsurprisingly show that the optimal a is problem specific, but that a € [2, 3] is a reasonable choice in all 
regimes. In general, a large a leads to better performance when n is small while small a is better when 
A is large. This is consistent with the intuition that large a makes the algorithm more conservative. The 
dependence on i/j is very weak (results are omitted). 


n = 10 4 and K = 2 and N = 2.3 X 10 5 and A varies n = 10 4 and K = 10 and N = 1.2 X 10 5 and A varies 





n n 

Figure 1: Parameter sensitivity 


Comparison to Other Algorithms 


I compare OCUCB and AOCUCB against UCB, MOSS, Thompson Sampling with a flat Gaussian prior and 
the near-Bayesian finite-horizon Gittins index strategy. For OCUCB I used a = 3 in all experiments, which 
was chosen based on the experiments in the previous section. For UCB and MOSS I used the following 
indexes: 


rucb 

1 t 


arg max /),; (t) + 
i 



rmoss 

1 t 


arg max /q (t) + 
i 



log max 



This version of UCB is asymptotically optimal [Katehakis and Robbins, 1995], but finite-time results are 
unknown as far as I am aware. In practice the 2 inside the square root is uniformly better than any larger 
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constant. The analysis of Audibert and Bubeck [2009] would suggest using a constant of 4 for MOSS, but 2 
led to improved empirical results and in fact the theoretical argument can be improved to allow 2 + e for any 
e > 0 using the same arguments as this paper. The finite-horizon Gittins index strategy is a near-Bayesian 
strategy introduced for one-armed bandits by Bradt et al. [1956] and suggested as an approximation of the 
Bayesian strategy in the general case by Nino-Mora [2011] and possibly others. For Bernoulli noise it was 
shown to have excellent empirical performance by Kaufmann et al. [2012a] while theoretical and empirical 
results are recently given for the Gaussian case by Lattimore [2015], The Gittins index strategy is not 
practical computationally for horizons larger than n = f0 4 , so is omitted from the large-horizon plots. 
For Thompson sampling I used the flat Gaussian prior, which means that I t = t for t € {1,..., K} and 
thereafter 

jthomp. samp. = arg max ^(£) + r/,;(f) , 
i 

where r]i(t) ~ Af(0 ,1 /T t (t)). In the first set of experiments I use the same regimes as Fig. 1. 


n = 10 4 and K = 2 and N = 1.2 X 10 5 and A varies n = 10 4 and K = 10 and N = 3 X 10 4 and A varies 



A 


A = 3/10 and K = 2 and N = 2 X 10 6 and n varies 




Figure 2: Regret comparison 


The results show that OCUCB is always competitive with the best and sometimes significantly better. 
Arguably the Gittins strategy is the winner for small horizons, but its computation is impractical for large 
horizons. MOSS is also competitive in these regimes, which is consistent with the theory (in the next section 
we see where things go wrong for MOSS). Thompson sampling and AOCUCB are almost indistinguishable. 
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Failure of MOSS 


The following experiment highlights the poor problem-dependent performance of MOSS relative to OCUCB. 
The experiment uses 

1 o 

u\ — 0 112 = ——7 m = — 1 for all i > 2 n = K . 

4A 

The results are plotted for increasing K and algorithms OCUCB/MOSS. Algorithms like Thompson sam¬ 
pling for which the running time is O(Kn) are too slow to evaluate in this regime for large K. As the theory 
predicts, the regret of MOSS is exploding for large K, while OCUCB enjoys good performance. Curiously 
the issues are only serious when K (and so n) is unreasonably large. In modestly sized experiments MOSS 
is usually only slightly worse than OCUCB. 


N = 600 


& 

w 



1,000 1,500 2,000 

K 

Figure 3: Failure of MOSS 


2,500 


Uniformly Distributed Arms 

In the final experiment I set fii = — (i — l)/iC for all i and vary n with K €= {10,100}. As in previous 
experiments we see OCUCB and MOSS leading the pack with Thompson sampling and AOCUCB almost 
identical and UCB significantly worse. 


K = 10 and N = 4.3 X 10 4 and n varies K = 100 and N = 7.2 X 10 3 and n varies 



Figure 4: Uniformly distributed arms 
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I Table of Notation 


K 

number of arms 

n 

horizon 

t 

current time step 

IM 

expected return of arm i 

l^i,s 

empirical estimate of return of arm i based on s samples 

£*(*) 

empirical estimate of return of arm i at time step t 

A, 

gap between the expected returns of the best arm and the ?'th arm 

A m ; n 

minimum gap, A 2 

log + (x) 

max {1, log(x)} 

H 

Ef: 2 A - 2 

Hi 

Ef =1 min{Ar 2 ,AT 2 } 

C 7 ,Cl, . . . , Cll 

non-negative constants (see Appendix D) 

5t, 5 a 

see Eq. (3) 

5a 

see Eq. (4) 

Ui 

number of samples that we expect to choose suboptimal arm i 
(see Eq. (3)) 

1 

ratio used in peeling argument 

A,A 

definition given in Eq. (5) 

a, 

parameters used by Algorithm 1 

Fa 

see Eq. (6) 

A 

see Eq. (7) 
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